A Biological Immunity-Based Neuro Prototype for Few-Shot Anomaly Detection with Character Embedding

Anomaly detection has wide applications to help people recognize false, intrusion, flaw, equipment failure, etc. In most practical scenarios, the amount of the annotated data and the trusted labels is low, resulting in poor performance of the detection. In this paper, we focus on the anomaly detection for the text type data and propose a detection network based on biological immunity for few-shot detection, by imitating the working mechanism of the immune system of biological organisms. This network enabling the protected system to distinguish the aggressive behavior of “nonself” from the legitimate behavior of “self” by embedding characters. First, it constructs episodic task sets and extracts data representations at the character level. Then, in the pretraining phase, Word2Vec is used to embed the representations. In the meta-learning phase, a dynamic prototype containing encoder, routing, and relation is designed to identify the data traffic. Compare to the mean-based prototype, the proposed prototype applies a dynamic routing algorithm that assigns different weights to samples in the support set through multiple iterations to obtain a prototype that combines the distribution of samples. The proposed method is validated on 2 real traffic datasets. The experimental results indicate that (a) the proposed anomaly detection prototype outperforms state-of-the-art few-shot techniques with 1.3% to 4.48% accuracy and 0.18% to 4.55% recall; (b) under the premise of ensuring the accuracy and recall, the number of training samples is reduced to 5 or 10; (c) ablation experiments are designed for each module, and the results show that more accurate prototypes can be obtained by using the dynamic routing algorithm.


Introduction
Text data analysis can effectively help us understand the data corpus, quickly identify potential problems in the data, and guide subsequent model training and selection.This kind of data is widely presented in networks, Internet, logs, devices, and operating systems.In order to find the anomalies in the data and prevent the damage to the system, an anomaly detection (AD) has been used as one of the most critical systems [1].For example, log AD refers to finding abnormal logs to determine the cause and nature of system faults.Usually, log data is modeled as a natural language sequence for AD.
Machine learning and deep learning are widely leveraged in the field of AD, hoping to improve the performance of AD systems, such as misuse based detection [2], deception based detection, and biobased detection [3].Machine learning uses algorithms to parse network data, learn characteristics of traffic data, and then classify and predict a certain class of things.Classic machine learning models, such as random forest [4], support vector machine (SVM) [5], and Adaboost [6], have been introduced to detect anomalies.Horng et al. [7] proposed an AD system based on SVM, in which a hierarchical clustering algorithm was used to deal with typical data.It reduced the complexity and redundancy of the dataset and further reduced the training time.Al-Yaseen et al. [8] proposed a multilevel hybrid detection model based on SVM, with the results showing an accuracy of 95.75%.In [9], a network AD system was presented based on Light GBM.They use an oversampling technique to increase minority samples of imbalanced training data to increase the detection accuracy.Since the detection performance of machinelearning-based algorithms is related to the manual selection of features, it requires plenty of professional knowledge to deeply mine the deep features in the data.
The deep-learning-based algorithms are to learn the inherent distribution and representation level of sample data and can automatically extract the characteristics of data such as text, images and sounds.At the same time, the nonlinear hidden layer structure in the neural network is helpful for the learning and prediction of high-dimensional data.In this context, deep learning algorithms are gradually considered for detection tasks, such as autoencoders [10,11], convolutional neural networks (CNNs) [12,13] and long short-term memory networks (LSTMs) [14].For example, Min et al. [2] proposed a system that combined Text-CNN and random forest to construct an anomaly-based network detection system.Kim et al. [15] used deep learning to generate virtual samples on the dataset, and proposed a malware detection method based on deep convolutional generative adversarial networks.The role of generative adversarial network is to generate similar data to detect the deformation of malware more accurately.In [16], it applied a deep belief neural network to extract data features, and then use the backward propagation neural network as a classifier to identify traffic data anomalies.In [17], it proposed an AD method based on BiLSTM deep learning.Experiments showed that in binary and multiclass AD, BiLSTM not only improved the performance of traditional LSTM but also had higher detection accuracy.
An AD based on deep learning is essentially a classifier trained on a large amount of data, highly dependent on feature engineering and dataset capacity.Data imbalance and less training data often appear in reality, which will cause overfitting, falling into local optimal solutions, or other problems.Concerning about these problems, some researchers have tried to introduce few-shot learning prototype into network AD in recent years, such as [18][19][20][21][22][23][24][25][26][27].Yu et al. [28] used the deep neural network (DNN) and CNN as the traffic embedding network to map each sample into a high-dimensional sample space.After that, the prototype vector of each anomaly category is obtained by averaging.Finally, the distance between the new anomaly and each prototype vector is measured to obtain the classification result.Rong et al. [25] proposed a few-shot learning based prototype UMVD-FSL to detect unseen malware variants with a small set of data.Start with network flow data generated by malware variants and benign applications, then convert them to grayscale images.A prototype-based few-shot learning model takes grayscale images as input and leverages meta-training to generalize the metalearner to adapt to new tasks.Xu et al. [18] designed a deep neural network, which is mainly composed of 2 parts: feature extraction network and comparison network, to classify network traffic samples.Guo et al. [26] integrated a global attention mechanism and aggregated the global information of inputs by capturing the byte relationship between payload sequence.A metric-based AD prototype was proposed in [27], which makes feature extractor fuse original bytes content with network flow features to improve detection precision and recall.As noted by Sung et al., RelationNet [21] builds a learnable nonlinear comparator through neural network to calculate the distance between 2 samples and then analyzes the sample similarity, instead of a fixed linear comparator such as Euclidean distance or cosine distance.In general, the above methods apply mean-based prototypes to measure similarity by Euclidean distance to obtain classification results.However, the nearestneighbor classifier based on mean prototype and the fixed linear comparator easily cause the estimation bias due to the data scarcity in few-shot scenarios and finally affect detection precision and recall.What is more, because of the hindrance of data parsing, building a universal network system to detect network traffic attacks is still a tough task for deep learning methods.
In this paper, we leverage the the working mechanism of the immune system to design a neural prototype for few-shot AD with character embedding, namely CharNet, which combines text embedding techniques and improved metric-based few-shot learning, improving the accuracy and recall of the existing deep models for few-shot network traffic classification.Specifically, CharNet consists of dataset construction, pretraining, metalearning 3 phases.Dataset construction phase is to transform the original traffic dataset into a episode-based dataset, converting the multiclassification problem into a 2-way K-shot problem.To skip the data parsing session, Word2Vec, a self-supervised learning method, is used in the pretraining phase to convert traffic into embedding vectors.Therefore, the multidimensional feature data classification task is transformed into a text classification task.In the meta-learning phase, we design a dynamic routing prototype network, which consists of 3 modules: encoder, routing, and relation, to identify whether the network traffic is normal or not.Compare to the mean-based prototype, CharNet uses a dynamic routing algorithm that assigns different weights to samples in the support set through multiple iterations to obtain a prototype that combines the distribution of samples.At last, the relation module gives the classification results by properly performing comparison between those traffic embedding vectors.The main contributions of this paper are as follows: • We propose the CharNet for few-shot AD.This method uses the dynamic routing algorithm to assign weights to the samples in the support set and builds a routing-based prototype, which effectively reduces the estimation bias and sampling bias caused by a small number of samples.
• We treat network traffic as a string and use character-level coding to omit data parsing sessions.This processing method can use the Word2Vec method to pretrain the embedding layer, which can learn prior knowledge for network traffic classification, effectively improving the training speed and accuracy.
• Compare our method with other methods, the accuracy of our method has improved.Using the proposed method, new types of samples on the basis of only a limited number of labels in an untrained dataset can be detected relying on learned prior knowledge.
The rest of the paper is organized as follows.Materials and Methods reviews the related work to our method.Experiments describes the prototype and details on our proposed prototype and presents our experiments and make comparisons with bigdata methods and other few-shot learning methods.Conclusion summarizes the paper and make the conclusion.

Problem formulation
We consider network AD as the task of few-shot classifier learning, whose purpose is to train a classifier f θ (⋅) with few samples x i and predict the corresponding labels ŷi of new samples xi .As shown in Fig. 1, we have 3 datasets: a base set  base , a metatraining task set  train =  train ,  train , and a meta-test task set  test =  test ,  test .Base set has a large number of samples with a set of classes  base .If the support set of meta task sets contain K-labeled samples for each of N unique classes, the target fewshot problem is called N-way K-shot.As the project's goal is usually to distinguish between normal samples and a particular type of malicious samples, we consider the network AD as a 2-way K-shot problem.
According episode-based training proposed in [29], in each episode iteration, the support set  is formed by K-labeled samples from each of the C classes, that is,  = {(x i , y i )} K×C i=1 .Similarly, the query set  is formed from the remainder of those C classes' samples, that is,  = {(x j , y j )} B j=1 .Meanwhile, the base set with abundant labeled samples  base = {(x i , y i )} m i=1 .Both meta task sets contain the support set and the query set, and the support set and query set share the same label space, but the label space of meta-training task set is disjoint with the the label space of meta-test task set.That is, Our goal is to learn a good meta-learner f θ (⋅) on the support set  based on prior knowledge obtained on the base set  base so that it can perform well on the query set .In our few-shot experiments (see Results and Discussion), we consider 5-shot (K = 5) and tenshot (K = 10) settings.

Overall prototype
As shown in Fig. 2, the proposed prototype is divided into 3 phases, including dataset construction, pretraining, and meta-learning.

Dataset construction
In the phase, we will construct base set  base , train task set  train, and test task set  test .We firstly perform preprocessing operations, such as duplicate value deletion, default value supplementation, and zero padding at the end of the most extended traffic log length.Then, each traffic log is tokenized by character level to extract fine-grained feature expressions.After that, in order to define each task as a 2-way K-shot task, we mix K normal samples with K malicious samples of each type.
The classes  train with a large number of samples are selected as the train task set  train , and the remaining classes  test is used as the test task set  test .To follow the episode-based strategy, the few-shot task generating details are shown in Algorithm 1.In order to improve the training speed and accuracy, we take the support set of train task set  train ⟨⟩ as the base set  base to participate in pretraining.

Pretraining
A Word2Vec [30] self-supervised classifier is trained with the samples in base set  base .Skip-gram model is applied in our word-embedding task, which predicts the surrounding words from the central word.We define 2 parameter matrices, W ∈ ℝ D×|V| and W ′ ∈ ℝ |V|×D , where D represents the embedding dimension and can be set to any size.Since we treat network traffic as a string, we tokenize the string by characters [31] and one-hot encode it according to the dictionary V to get the sequence u = {u 1 , u 2 , …, u L }, L is the length of the sequence.Here, |V| is the size of the dictionary set V. Skip-gram works as the following steps:

Meta-learning
In this phase, a meta-learner f θ (⋅) learn meta knowledge in  train , and they can learn quickly and accurately with few data in  test .This phase is divided in 2 phases: meta-training and meta-test.In Dataset construction, we have constructed a train task set  train and a test task set  test , both containing the support set S and the query set Q.In Pretraining, a feature extractor f φ (⋅) is trained on  base .
In meta-training phase, based on the feature extractor f φ (⋅), a meta-learner f θ (⋅) is trained on the support set of train task set  train ⟨⟩ by maximizing the likelihood estimation on the query set of train task set  train ⟨⟩.That is, where ω represents meta knowledge, and φ and θ represent the parameters of f φ (⋅) and f θ (⋅), respectively.We learn meta knowledge by sampling a large number of train tasks, so the optimal meta knowledge ω can be expressed as this: In meta-test phase, based on the optimal meta knowledge ω * that has been learned, the optimal meta-learner parameters θ * are found as following: As for evaluation, we directly predict labels in  test ⟨⟩ by the optimal meta-learner f θ * (⋅) and then compare with the ground truth to evaluate the performance.

Architecture of CharNet
Our CharNet includes 3 modules: encoder module, routing module, and relation module, which is shown in Fig. 3 (the case of 2-way 3-shot model).

Encoder module
In order to consider both the historical information and future information of the sequence, and let the model focus more on finding helpful information in the input data that is salient and relevant to the current output, we adopt the bidirection LSTM network with self-attention [32].For simplicity, the encoder module receives an input sequence x = (c 1 , c 2 , …, c L ), where c l (1)  represents the character embedding, L represents the length of sequence.The forward hidden state � ⃗ h t and the reverse hidden state ⃖� h t are obtained by biLSTM and then concatenate � ⃗ h t and ⃖� h t to obtain the hidden state h t .
We set the dimension of each LSTM unit hidden state to u, and the set of all hidden units state is H = (h 1 , h 2 , …, h L ).Through a linear transformation, a variable dimension of input sequence x is transformed into a fixed dimension of hidden state sequence H. Afterwards, the self-attention mechanism is used to assign a corresponding attention score to each hidden state, which takes the set of whole hidden state H as input, and outputs a vector of weights a.
Here, W a1 ∈ R d a ×2u and W a2 ∈ R d a are weight matrices and d a is a hyperparameter.The output representation e of the encoder is the weighted sum of a and H:

Routing module
The dynamic routing algorithm is the core of this section, which is similar to the multihead attention mechanism.It can assign the weight of the samples in the support set through multiple iterations, so as to obtain a prototype vector that combines the distribution of the samples.We regard these vectors e obtained from the support set S by Eq. 8 as sample vectors e s , and the vectors e from the query set Q as query vectors e q .Routing module converts sample vectors e s ij to prototype vectors p i through a nonlinear mapping, where i = 1, …, C and j = 1, …, K.
Since we treat the flow as a string, and the order of the characters plays a crucial role in the model, we multiply all the sample vectors in the support set e s ij with a transformation matrix W s ∈ R 2u×2u and add a bias b s .After iteration, a most representative linear mapping can be found.Each sample prediction vector ês ij is computed by: where squash is defined as Eq. 10, which not only ensures that the data is between 0-1, but also preserves the direction of the vector.
In order to ensure that the prototype vector can automatically aggregate the sample feature vectors of this class in the case of very little data capacity, it is necessary to iteratively apply the dynamic routing mechanism.At each iteration, the coupling coefficients d i for each class i sum to 1 by softmaxing b i .where b i is the logits of coupling coefficients, and initialized by 0 in the first iteration.Given each sample prediction vector ês ij , each candidate prototype vector pi is a weighted sum of all sample prediction vectors ês ij in class i: Then the "squash" function is applied to ensure that the length of the vector output of the routing process will not exceed 1: The last step of each iteration is to adjust the logarithm of the coupling coefficient b ij by means of "protocol routing".If the modulus between the sample prediction vector ês ij and the candidate prototype vector is large, that is, the two are very similar, then increase the coupling coefficient of the prediction vector, and through several iterations, a good coupling relationship can be obtained, and the final prototype vector can be obtained.
Formally, the dynamic routing algorithm is shown in Algorithm 2.

Relation module
We obtain the prototype vectors p i through the routing module mentioned in Routing module and use the encoder module mentioned in Encoder module to convert the samples in the query set into query vectors e q .Then, we need to measure the connection between query vectors e q and prototype vectors p i .We draw on the ideas in [21] and use neural networks to replace common mathematical distance metrics.Thus, the similarity measure between p i and e q is represented by the relation score, which is between 0 and 1.The neural network consists of a neural tensor layer and a sigmoid layer, where the neural tensor layer outputs a relation vector as follows: where is one slice of the tensor parameters and f is RELU activation function.The final relation score r iq between the i-th class and the q-th query is calculated by a fully connected layer activated by a sigmoid function.

Objective function
In the process of training the model, we set the mean square error loss as the loss function.The purpose is to transform a 2-way classification problem into a similarity regression problem, that is, the similarity problem between relationship scores r iq and the ground truth y q .Given the support set  with 2 classes and query set  in an episode, the loss function is defined as: All parameters of the 3 modules are trained jointly by backpropagation.The stochastic gradient descent is used on all parameters in each training episode.Our model does not need any finetuning on the classes it has never seen due to its generalization nature.The routing and comparison ability are accumulated in the model along with the training episodes.

Results and Discussion
In this section, 2 datasets are selected to evaluate the performance of our proposed CharNet by comparing with big-data and fewshot network AD methods respectively.After that, the function of each module is analyzed through ablation study.

CICIDS2017FS
CICIDS2017 dataset [33] contains benign and the most up-to-date common attacks, which resembles the true real-world data (PCAPs).Each network traffic has been labeled by using CICFlowMeter with labeled flows based on the time stamp, source, and destination IPs, source and destination ports, protocols and attack.The dataset includes the most common attacks based on the 2016 McAfee report, such as Web based, Brute force, DoS, DDoS, Infiltration, Heart-bleed, Bot and Scant.We select 14 different types of attacks to form the data sets, 10 types of which were used as base set and 4 types as support set and query set.Then, using the task generating Algorithm 1 to generate 2-way K-shot tasks from each sets.The number of samples in the CICIDS2017FS is shown in Table 1.

CM2021FS
This dataset was collected from a week of real server, which contains nearly 160,000 samples in 6 different types of attacks and normal requests.The 6 classes are respectively divided into 3 (Cloud Server request, blocked IP, crawler tool) and 3 (SQL injection, Directory traversal, XSS cross-site) for pretraining tasks and meta-learning tasks.We create 2-way K-shot learning models on this dataset.The number of samples in the CM2021FS is shown in Table 2.

Architecture
We use Word2Vec [30] to pretrain the language coding layer on dataset  base , where the dimension is 300 and the window size is 64.In encoder module, we set the dimension of hidden state of LSTM to 128 and the dimension of attention matrix to 64.In routing module, the iteration number iter is 4. In relation module, the output dimension is 100, and the activation function is Relu.After that, it is transformed into a score of [0,1] through a sigmoid layer.

Training details
In the pretraining stage, we pretrain the word embedding on base dataset  base with Word2Vec [34].In the meta-training stage, we train the CharNet with 10,000 episodes on the support set of train (15)  task set  train ⟨⟩ in an episodic manner via a stochastic gradient descent with momentum of 0.9 and weight decay of 0.0005.Then, we chose the model with the highest accuracy in the query set of train task set  train ⟨⟩ as the final model.In the meta-test stage, We build 2-way K-shot (K = [5,10]) models on 2 datasets to simulate the scenario of the network AD.

Discussion of results
In this section, we provide an overview of several most recent deep learning and machine learning algorithms in network AD and discuss the detection results and the number of samples.Then, we compare the proposed routing network with other few-shot learning methods.After that, the ablation study is conducted to evaluate the effect of each module.

Comparison with big-data methods
In previous studies, many scholars have achieved excellent performance using machine learning and deep learning algorithms  based on large amounts of data.To ensure the fairness of the experiments, we made a comparison between CharNet with other existing researches that used the same public benchmark dataset CICIDS2017.At the same time, we also conduct fewshot experiments on the private dataset CM2020FS, the precision and recall were 94.29% and 96.56% for K = 5, and 98.56% and 97.68% for K = 10.The results are shown in Table 3.
The noteworthy observation is that the overwhelming majority of researches, whether machine learning or deep learning, are based on "big data".As can be seen from Table 3, the accuracy and recall of these researches are almost above 95%, for example, Multi-Stage Optimized ML-based AD (2021) [4] even achieves 99.99% accuracy and 99.00% recall.However, their sample size has reached hundreds of thousands, or even millions, which requires tremendous human efforts to collect, process and label these data manually.In addition, it is difficult for us to quickly identify new and ever-changing attacks.
As shown in Table 3, CharNet (K = 5) only outperforms GA-based Adaptive Method [35], Multilayer ensemble SVM [5], and SU-AD [36] in both accuracy and recall, while CharNet (K = 10) outperforms all methods except GBT-based Big Data Method [37], Improved AdaBoost-based AD [6] and Deep Hierarchical AD [38].With a slight decline about 4% and 0.8% in accuracy and recall (K = 5), and about 0.1% and 0.02% in accuracy and recall (K = 10), required training samples of CharNet is much fewer than these method.It should be noted that the result obtained by CharNet is on the basis of only 5 and 10 labeled samples used in training process, which drastically reduces the cost of data collection and manual labeling.

Comparison with few-shot learning methods
We compare CharNet with some state-of-the-art metric-based approaches of network AD, such as FC-Net, DF-Net, GP-Net, and FS-AD.FC-Net [18] is based on RelationNet to implement fewshot traffic classification and achieves good performance in its experimental setting.DF-Net [23] takes use of siamese capsule network for AD with imbalanced traning data.A relative position mechanism and a global-enhanced feature extractor are designed

Pretrain
To ensure fairness, we compare the performance of our method and the baseline methods on the same benchmark dataset CICIDS2017FS, the result is shown as Table 4.In order to be consistent with the setting of FC-Net, we did 2 experiments with K = 5 and K = 10.It can be found that our proposed network achieves the highest precision and recall in both k = 5 and k = 10 experiments.In the 2-way 5-shot experiment, the accuracy of CharNet exceeds the state-of-the-art methods by 1.3%∼2.34%,and the recall rate exceeds 0.18%∼4.55%.In the 2-way 10-shot experiment, the accuracy exceeds the advanced method by 2.36%∼4.48%,and the recall rate exceeds 0.81%∼3.58%.It is worth noting that our method beats FC-Net, which demonstrates that the routing-based prototype is more effective than the mean-based prototype.

Ablation study
The experiments are conducted on CICIDS2017FS and CM2021FS to evaluate the effect of each module, i.e., encoder module, routing module, relation module, and pretraining.Specifically, (a) we change the backbone network of the encoder to explore the best feature extractor; (b) for the routing module, we change the number of iterations iter of the routing algorithm to find the best routing-based prototype in different datasets and also compare with the mean-based prototype.(c) We replace the relation module with cosine distance metric and euclidean distance metric.

The effect of encoder module
In the experiment of the encoder module, the different backbone network are applied, such as biLSTM, LSTM, CNN, and Tranformer.The results are shown in Table 5.It can be found that (a) the accuracy of biLSTM with attention on both datasets outperform the other 3 backbone networks, while the transformer is not too far apart, whose recall achieves the best results on CICIDS2017FS; (b) the attention mechanism plays a role and helps biLSTM improve the accuracy and recall by around 1%; (c) LSTM with attention is around 4% in precision and 3% in recall lower than biLSTM with attention, which means that the informal text such as network traffic requires bidirectional semantics to better express the information in it.
In Fig. 4, it shows the t-stochastic neighbor embedding [39] visualization before and after encoder module.We carry out the test task of CICIDS2017FS, which includes 4 categories: DDos, Slowhttptest, FTP-Patator, and PortScan.We can see that the vectors after encoder are more separable, demonstrating the effectiveness of encoder to separate the solution space.

The effect of routing module
To explore the effect of iterations iter on the routing module, we set iter from 1 to 6 on CICIDS2017FS and CM2021FS.In addition, the routing module is removed and replaced the meanbased prototype.According to the result in Table 6, we observe that (a) the best performance is achieved when we used 4 and 5 iterations, and more rounds of iterations did not further improve the performance; (b) the best performance of routing module exceeds the mean-based prototype by around 1%, which indicates that the routing module shows effectiveness.

The effect of relation module
For the relation module, we draw on the idea of relation network [21] and use neural network training to obtain a learnable nonlinear similarity measure function, thereby constructing an end-to-end network structure.The experimental results of the relaton module are shown in Table 7, from which we find that the relation module outperform the cosine and euclidean distance metric on both datasets.

The effect of pretraining
To pursue faster training speed, we consider adding pretraining before the encoder module.Figure 5 shows the accuracy and loss curves with and without pretraining in CICIDS2017FS and CM2021FS, where iter is set to 4 and 5 respectively, and episode is set to 8,000.It can be seen intuitively that the convergence can be faster with pretraining, indicating that the pretraining can effectively extract the text information in the traffic log.

Conclusion
In this paper, we propose the CharNet, a novel neural model for few-shot network AD prototype.For this purpose, a basic binary classification task was defined and a pair of network traffic samples including a normal unaffected sample and a malicious one were constructed for learning.The routing module combines the dynamic routing algorithm with a meta-learning prototype, and the routing mechanism finds a more accurate prototype through multiple iterations, making our model more general to recognize unseen classes.The experiment results show that the proposed model outperforms the existing state-of-the-art fewshot network AD models, and only need 5 or 10 samples to get a high-accuracy model.We found that both the pretraining and encoder contribute tremendously to the few-shot learning tasks.

Fig. 1 .
Fig. 1.The illustration of the division of meta-learning tasks.

Table 1 .
The number of samples in the CICIDS2017FS

Table 2 .
The number of samples in the CM2021FS

Table 3 .
Comparison of detection result and the number of samples in the proposed method and big-data methods

Table 4 .
Few-shot experimental result on CICIDS2017FS

Table 5 .
Ablation study with encoder module on 2-way 10-shot tasks

Table 6 .
Ablation study with routing module on 2-way 10-shot tasks

Table 7 .
Ablation study with relation module on 2-way 10-shot tasks